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Abstract. Test set with redundancy is one of the focuses in recent bioin- 
formatics research. Set cover greedy algorithm (SGA for short) is a com- 
monly used algorithm for test set with redundancy. This paper proves 
that the approximation ratio of SGA can be (2 — ^) In n + | In r + 
0(hiln7i) by using the potential function technique. This result is better 
than the approximation ratio 2 In n which directly derives from set mul- 
ticover, when r — o( j^"^"^ ), and is an extension of the approximability 
results for plain test set. 



1 Preliminaries 
1.1 Test Set Problems 

Test set problems arise in pattern recognition, machine learning, and bioin- 
formatics. Test set is NP-hard. The algorithms used in practice include sim- 
ple "greedy" algorithms, branch and bound, and Lagrangian relaxation. The 
"greedy" algorithms can be implemented by set cover criterion or by informa- 
tion criterion, and the average performances of the two types of "greedy" algo- 
rithms are virtually the same in practice [1]. Test set is not approximable within 
(1 - e) Inn for any e > unless NP C DTIME(n^°s^°s^)^. 

Recently, the precise worst case analysis of the two type "greedy" algorithms 
has been accomplished. The authors of [3] designed a new information type 
algorithm, information content heuristic (ICH for short), and proved its approx- 
imation ratio Inn H- 1, which almost matches the inapproximability result. The 
author of [4] proved that the approximation ratio of set cover greedy algorithm 
(SGA for short) can be 1.141nn, and showed a lower bound 1.00071nn of the 
approximation ratio of this algorithm. 

Test set with redundancy, which can be regarded as a special case of set mul- 
ticoveilll, captures the requirement of redundant distinguishability in the string 
barcoding problem^ and the minimum cost probe set problem 6J in bioinfor- 
matics. 



^ This paper considers the case each subset can be selected only once, which is called 
constrained set multicover in: Vazirani V V. Approximation Algorithms. Springer, 
2001. 108-118. 



The input of test set with redundancy r e Z+ consists of a set of items S 
with 1 5*1 = n, a collection of subsets (called tests) of S, T. An item pair is a set 
of two different items. A test T differentiates item pair a if |Tna| = 1. A family 
of tests T' C T is a r-test set of S if each item pair is differentiated by at least 
different r tests in T' . The objective is to find out the r-test set of minimum 
cardinality. 1-test set is simply abbreviated to test set. 

Definition 1 (Test Set with Redundancy r).0 
Input: S, T; 

Feasible Solution: r-test set T' , T C T; 
Measure: |T'|; 
Goal: minimize. 

We use a ± T to indicate the fact that T differentiates a and use ± (a, T) 
to represent the number of tests in T that differentiate a. We give the following 
two facts without proof. If T is a r-test set, then |T| > log2 n. If T is a minimal 
r-test set, then |T| < r(n — 1). 

1.2 Set Cover Greedy Algorithm 

Test set with redundancy can be reduced to set multicover in a natural way. Let 
(5, T) be an instance of test set with redundancy r, we construct an instance 
([/, C) of set multicover with coverage requirement r, with U = G 
S,i ^ j}, and 

C = {c{T)\T e r},c(T) = eT,jeS-T} 

Clearly, T' is a r-test set iff C = {c{T)\T e T'} is a r-set cover of U. 

SGA runs the same way as the greedy algorithm for set multicover. We say 
an item pair a is alive if it is differentiated by fewer picked tests than r. In each 
iteration, the algorithm picks, from the currently unpicked tests, the tests differ- 
entiated most undifferentiated alive item pairs. Formally, SGA can be described 
as: 

Algorithm. SGA 
Input: S,r ; 

Output: a 7--test set of S\ 
begin 

f ^ 0; 

while #(T) > do_ 

&;_elect T in T - f minimizing #(f U {T}); 

r <-ru{T}; 
endwhile 
returnT. 
end 



^ In Definition 1, we suppose there are no two tests T\ and T2 satisfying Ti — S — T2. 



Definition 2 (Partial r-Test Set and Differentiation Measure). 

We call T the partial r-test set. The differentiation measure ofT is defined as 
^(T) = X^a ™^-'^(''~ -'- '^'^'^ differentiation measure ofT related to 

f is defined asff^{T,f) = #(T)-#(ru{T}). Denote #o = #(0) = rn(n-l)/2. 

The greedy algorithm for set multicover has approximation ratio where 
N = \U\^. Using the natural reduction, we immediately obtain the approxima- 
tion ratio 2 Inn of SGA. Using the standard multiplicative weights argument, 
we can obtain another approximation ratio In #o ~ In ™* + 1 of SGA, where m* 
is the size of the optimal r-test set(See Lemma 19 in ^). 

The authors of [H] designed a randomized multi-step rounding algorithm 
(RND for short) for set multicover, and the expectation of the approximation 
ratio is approximately no more than IniV — Inr. Experiments on test set show 
when r is small, SGA performs better than RND, and when r is near to or more 
than n, RND performs better than SGA[5]. 

1.3 Our Method and Result 

In [TD], Young addresses "oblivious rounding" technique to get another proof the 
of the well-known approximation ratio In n -I- 1 of the greedy algorithm for set 
cover. He observes the number of elements uncovered is an "potential function" 
and the approximation algorithm only needs to drive down the potential function 
at each step. 

Arora et al. present a simple meta algorithm that unifies many disparate 
algorithms and drive them as instantiations of the meta algorithm [11] . They call 
the meta algorithm multiplicative weights method, and suggest it is viewed as a 
basic tool for designing algorithms. 

This paper proves that the approximation ratio of SGA can be (2 — ^) lnn-|- 
|lnr + O (In Inn) by applying the potential function technique. This result is 
better than the approximation ratio 2 In n which directly derives from set multi- 
cover, when r = o( j^"^^ ), and is an extension of the approximability results for 
plain test set. The analysis of this algorithm fits in the framework of multiplica- 
tive weights method. 

In Section 2, the authors analyze the phenomenon of "differentiation repeti- 
tion" of test set with redundancy and apply the potential function technique to 
prove improved approximation ratio of SGA. Section 3 is some discussions. 

2 Proof of Our Result 
2.1 Differentiation Repetition 

Practitioners of test set problems are aware of the phenomenon that the number 
of times for which the item pairs are differentiated tends to be more than the 
requirement. In another word, item pairs differentiated for small number of times 
are quite "sparse", especially when m* is small. 



The author of ^ investigates this unique characteristic of test set quantita- 
tively. He analyzes the distribution of times for which item pairs are differenti- 
ated, especially the relationship between the differentiation distribution and the 
size of the optimal test set. The following lemma on test set with redundancy 
can be obtained as a corollary. 

Lemma 1. Let T* be an optimal r-test set, and m* = then at most 

2nlog2 nm,*^^^ item pairs are differentiated by exactly r test in T* . 



2.2 Improved Approximation Ratio 

In this subsection, the authors apply the potential function technique to prove 
improved approximation ratio of SGA. We note the decrease of the potential 
function can be "accelerated" in the beginning phase of SGA. Our proof is based 
on the technique to balance the potential function by appending a negative term 
to the differentiation measure. 

Lemma 2. Given an instance {S, T) of test set with redundancy r, let T* be an 
optimal r-test set, m* = \T*\, and is the number of item pairs differentiated 
by exactly r tests in T* , then the size of the solution returned by SGA is no more 
than (ln#o - ^ In 1^ + ^ ln(r + 1) + l)m* + 1. 

Proof. Clearly, there is a partial r-test set Ti such that > #b, but after 

selecting the next test T, #(71 U {T}) < #3. Let the set of selected tests 
after selecting T until the algorithm stops is T2. Then the returned r-test set is 
T' = Ti U {f} U T2. Let fc = In t+lMsi, 
Define the potential function as 



#B 



Then 



/(T) = (#(r) - - ^ A)'"'^' 



1 1 ^fe ;,('^ + l)#0,_ #i 



= (#0 - —^#b){i -Y < #o/ C ^"" ) 



1 r m* r + 1 

Given T, let p denote the probability distribution on tests in T* — T: draw 
one test uniformly from T* — T. For any T G T* — T, the probability of drawing 
Tis p(r) = ^-L^. 

For any item pair a, 

^ ^^^^ ^ IT- - ri ■ 

Since _L {a,T*) > r, 

m 

TeT*-T :a_LT 



If _L (a,r*) > r + 1, 

V «r)>r^AM)±i. 



TeT'-T:al.T 



By the definition of /(T) and the facts p(T) > and Etgt* p(^) = 1' 



min /(TU{T}) 

TeT-r 

< min _/(tu{T}) 
Ter*-r 



and 



< iPiT)f{Tu{T})) 

= im--^*B- E (p(r)#(T,T)))(i-^-^)'=-i^i-i 

r + 1 _ r m* 

TeT'-T 

oHue a TeT* -T :a±T 



alive a TeT* -T ■.a±T 

r-±{a,T) y-^ r-±{a,T) + l 



> 



m* ^ 



m* ^-^ m* 

a■.^-{a,T^)=r a:-L(o,T*)>r+l 

^ r- ± (a,t) + l J_ 
.iL^ m* m* 

alive a a:_L(a,T*)=r 

m* ^ r— ±(a, T) m* 

alive a ^ ^ 

T" + 1 1 - r 

>— — ^(#m-^#B). 

r m* r + 1 



Hence 



min_ /(tu {T}) < (#(r)-^#B)(l-^^)(l-^^)'=-l^l-i = /(T). 
Ter-r r + 1 r m* r m* 

For partial r-tcst set T, the algorithm selects T in T — T to minimize /(T U 
{T}). Therefore, f{T,) < /(0) < ^. 
By definition of Ti , 

m) > (#B - - !:±li,)-i-^i = %(i - !:±li,)-i-^i. 

r + l rm* r + l rm* 

Therefore, (1 - ^ :;^)''~^'^^^ < 1, and \Ti\ < k. 



We can easily prove |72| < (ln#B + l)m* by natural reduction to set multi- 
cover. When the algorithm stops, the size of the returned solution is 

|T'| = ITil + IT2I + 1< (In #0 - ^ In + ^ ln(r + 1) + l)m* + 1. 

□ 

Theorem 1. The approximation ratio of SGA for test set with redundancy r 
can &e (2 - ^) Inn + I Inr + O(lnlnn). 

Proof Let pi = ln#o - Inm* + 1, and p2 = ln#o - :p^\n 2„iogf„°„»^-i + 
ln(r + 1) + 1. Then pi is an upper bound of the approximation ratio ([5]), 
and pi is also an upper bound of the approximation ratio by Lemma 1 and 
Lemma 2. 

For fixed r and n, pi is a decreasing function of m* , and p2 is an increasing 
function of to*. min{pi, P2) is maximized when pi — p2. This leads to In to* = 
^ Inn — i Inr — O(lnlnn), which implies min(pi, P2) < (2 — In n + | Inr + 
O(lnlnn). " □ 

3 Discussions 

In this paper, the authors show new approximability result for test set with 
small redundancy, which is better than approximation ratio which directly de- 
rives from set multicover. It seems that ICH can not be generalized to test set 
with redundancy r > 1. This situation raises an interesting problem if the ap- 
proximation ratio of test set with redundancy can be pushed to the matching 
bound Inn -I- 1 of plain test set. 
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